Project: Ford Go Bike Data Anaylsis

by Aly Reda

Table of Contents

Introduction

Feb 2019 Ford Bike rides all over CA specially San Francisco


What is the structure of your dataset?

The original Bay Wheels's trip dataset contains 183412 rides and 16 features. I added 13 more features for analysis purpose.

  1. duration_sec
  2. start_time
  3. end_time
  4. start_station_id
  5. start_station_name
  6. start_station_latitude
  7. start_station_longitude
  8. end_station_id
  9. end_station_name
  10. end_station_latitude
  11. end_station_longitude
  12. bike_id
  13. user_type
  14. member_birth_year
  15. member_gender
  16. bike_share_for_all_trip
  17. duration_hr
  18. distance_km
  19. speed_km/hr
  20. member_age
  21. age_group
  22. start_hour
  23. end_hour
  24. day
  25. weekday
  26. month
  27. quarter
  28. season

What is/are the main feature(s) of interest in your dataset?

  1. Duration , Distance , Speed
  2. Age, Gender, Membership Status
  3. Month, Weekday, Hour , Season , Quarter

What features in the dataset do you think will help support your investigation into your feature(s) of interest?

Columns should be added to assist analysis:

  1. duration_hr
  2. distance_km
  3. speed_km/hr
  4. member_age
  5. age_group
  6. start_hour
  7. end_hour
  8. day
  9. weekday
  10. month
  11. quarter
  12. season

1. Data Wrangling

1.a import packages

1.b Functions

1.c Database Exploring

1.d Database Cleaning

2. Univariate Exploration

2.a Duration Outliners Checking

99% of the duration less than an hour

remove duration more than 1.25 hour (the user forget to end the trip)

2.b Member Age Outliners Checking

99% of the Ages less than 64 years old

remove Ages above 60 (the user may add rondom age)

2.c Distance Outliners Checking

Almost 5% of the Distance more than 1/2 Kilo meter

remove Distance below 1/5 Kilo meter (may have same start point but different long and lat)

Almost 99% of the Distance less than 5 Kilo meter

remove Distance above 5 Kilo meter (the bike may stolen)

2.d Speed Outliners Checking

Almost 1% - 5% of the Speed between 2-5 Km/hr

remove Speed below 3 Km/hr (the user may walking beside the bike)

Discuss the distribution(s) of your variable(s) of interest. Were there any unusual points? Did you need to perform any transformations?

Duration Age Distance Speed

1. Duration: Outliers with period longer than 1.25 hr.
(user may forget to end the trip)
2. Age: Outliers with user older than 60 years old.
(maybe user add rondom birth year)
3. Distance: Outliers with distance less than 0.5 Km and more than 5 Km than.
(first same point different lat and long and the second the bike may stolen)
4. Speed: Outliers with speed less than 3 Km/hr.
(the user may walking beside the bike)

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

1. Dataframe: Replace inf by null and remove all null value

2. Duration: Convert duration by (sec to hr)

3. Distance: Get the distance between two lat and long of the start and end points

4. Distance: Remove distance equel 0 (the start and the end point are same)

5. Speed: Calculate Speed (Km/hr) From the duration (hr) and the distance (Km)

6. Birth Year: Convert birth year to int

7. Age: Calculate the age of the member

8. Age Group: Calculate age group to make it easy to compare

9. Hour Day Weakday Month Quarter Season: Calculate to invesigate the distrubution per period

3. Bivariate Exploration

3.a Duration Distribution

Most rides takes within 300 - 600 sec

3.b Speed Distribution

Most rides speed within 11 - 12 Km/hr

3.c Age Distribution

32 years old people having the highest rides count

30s having the highst rides count then 20s

3.d Start-End Hours Distribution

8 AM -9 AM and 5 PM - 6 PM having the highest start rides count (very big chance because using bike to go work and get back )

3.e Days Distribution

28th having the highest start rides count

3.f Weekday Distribution

Thursday is the most day having ride all over the weekday

3.g Season, Quarter and Month Distribution

The data for just for one month (X-2-2019)

3.h Gender Distribution

Male having the highest bike ride counts

3.i User Type Distribution

Subscriber having 91.1% of the rides

3.j Location and Route Distribution

Rides Locations: San Francisco , Oakland and San Jose

The Longest Ride Distance 5 km From Fell St at Stanyan St To Folsom St at 3rd St

The Shortest Ride Distance 0.5 km From North Berkeley BART Station To West St at University Ave


Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Time

  1. Most rides takes within 300 - 600 sec
  2. 8 AM -9 AM and 5 PM - 6 PM having the highest start rides count (very big chance because using bike to go work and get back )
  3. 28th having the highest start rides count
  4. Thursday is the most day having ride all over the weekday¶
  5. The data for just for one month (X-2-2019) ### Speed
  6. Most rides speed within 11 - 12 Km/hr ### Age
  7. 32 years old people having the highest rides count
  8. 30s having the highst rides count then 20s ### Gender
  9. Male having the highest bike ride counts ### User Type
  10. Subscriber having 91.1% of the rides ### Location
  11. Rides Locations: San Francisco , Oakland and San Jose
  12. The Longest Ride Distance 5 km From Fell St at Stanyan St To Folsom St at 3rd St
  13. The Shortest Ride Distance 0.5 km From North Berkeley BART Station To West St at University Ave
  14. The Most Popular Route 1.3 km From Berry St at 4th St To San Francisco Ferry Building (Harry Bridges Plaza)

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

In comparing the user type and the age group there is almost no customer 10s or 60s

4. Multivariate Exploration

4.a User Type Vs. Gender

Male Subscriber having the high rides count

4.b User Type Vs. Gender Vs. Age Group

Male Subscrber having the Most Rides specially of group age 20s-30s

4.c Gender Vs. Weekday Vs Duration, Distance and Speed

Generally Gender increating by the week end

Female having the longest Duration riding and Male having the less

Others having the lonest distance then Females, it make sense they take more time

Both Male and female riding more distance by Thu , Fri and having less distance by the week end

Male almost having the faster Speed

Male,Female and Others riding slower by the week end

4.d User Type Vs. Weekday Vs Duration, Distance and Speed

Customer hacing the longest duration

Both Riding longer duration by week end

Customer hacing the longest distance

Both Riding less distance by week end

Both Riding slower by the week end

Subscriber hacing the higest speed

4.e Age Vs. Weekday Vs Duration, Distance and Speed

10s having slowest riding speed (more time and less distance)

60s having second place duraction with 1st place distance and speed

All Duration decreased by the week end except 10s

All Speed decreased by the week end except 60s

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

User Type Vs. Gender

1. Male Subscriber having the high rides count

User Type Vs. Gender Vs. Age Group

1. Male Subscrber having the Most Rides specially of group age 20s-30s

Gender Vs. Weekday Vs Duration, Distance and Speed

1. Generally Gender increating by the week end¶

2. Female having the longest Duration riding and Male having the less

3. Others having the lonest distance then Females, it make sense they take more time

4. Both Male and female riding more distance by Thu , Fri and having less distance by the week end

5. Male almost having the faster Speed

6. Male,Female and Others riding slower by the week end

User Type Vs. Weekday Vs Duration, Distance and Speed

1. Customer hacing the longest duration¶

2. Both Riding longer duration by week end

3. Customer hacing the longest distance

4. Both Riding less distance by week end

5. Both Riding slower by the week end

6. Subscriber hacing the higest speed

Age Vs. Weekday Vs Duration, Distance and Speed

1. 10s having slowest riding speed (more time and less distance)

2. 60s having second place duraction with 1st place distance and speed

3. All Duration decreased by the week end except 10s

4. All Speed decreased by the week end except 60s

Were there any interesting or surprising interactions between features?

1. Customer hacing the longest distance

2. Subscriber hacing the higest speed